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Abstract 

The proliferation of the Open Archive Initiative Protocol for Metadata Harvesting 
(OAI-PMH) has resulted in the creation of a large number of service providers, all 
harvesting from either data providers or aggregators. If data were available regarding 
the similarity of metadata records, service providers could track redundant records 
across harvests from multiple sources as well as provide additional end-user services. 
Due to the large number of metadata formats and the diverse mapping strategies 
employed by data providers, similarity calculation requirements necessitate the use of 
information retrieval strategies. We describe an OAI-PMH aggregator implementa- 
tion that uses the optional "<about>" container to re-export the results of similarity 
calculations. Metadata records (3751) were harvested from a NASA data provider 
and similarities for the records were computed. The results were useful for detecting 
duplicates, similarities and metadata errors. 

1 Introduction 



1.1 Problem Statement 

The Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) is a suc- 
cinct, six verb protocol for the dissemination of metadata and represents the de-facto 
metadata exchange standard for today's digital libraries |2j. Yet with over six million 
publicly available records, no general purpose tools exist to aid service providers in 
discovery of similarity among harvested records. 

OAI-PMH increases the venerable problem of duplicates in union catalogues . Fig- 
ures n an d El depict scenarios where similar or duplicate records could be harvested. 
Figure ^ depicts a scenario where a service provider has twice harvested from Data 
Provider 2. Perhaps Aggregator 1 or Aggregator 2 have performed some metadata 
normalization, resulting in new OAI identifiers being assigned. "Sameness" becomes 
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Figure 1: A possible scenario where duplicate records may be harvested via separate 
aggregators. 
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Figure 2: A possible scenario where duplicate records may be harvested through multiple 
descriptions of the same resource. 

harder to compute. The <provenance> records can help identify duplicates, but they 
are optional. Figure [21 shows a service provider harvesting from two separate data 
providers. Both data providers have reviews of DJ Shadow LPs. Are the two reviews 
of "Entroducing" the same, or are they merely similar ? 



1.2 Proposed Solution 

A general purpose system devised to measure the similarity of OAI metadata records 
using information retrieval (IR) methodologies has many valuable uses. It may be 
used to weed out duplicate records that might otherwise be difficult to find by more 
traditional field matching methodologies. Also, it could be used to find additional 
versions of the same work, such as locating the short and long paper versions of the 
same project. Finally, this system could find similar documents in accordance with a 
predetermined threshold for use in recommendation systems such as those described 
in P. 

Such a "similarity engine" has been built to calculate similarity among harvested 
metadata records and thereby detect duplicate or similar documents. Our OAFPMH 
aggregator incorporates these results, appending similarity data to metadata record 
requests. When issued a "GetRecord" request, the standard metadata for the iden- 
tifier is returned along with a ranked list of similar documents. Ranking values are 
between and 1, where 1 represents absolute similarity. This work is intended as a 
proof of concept and will require further optimization before dealing with the open 
corpus of OAFPMH metadata (over 6.5 million records). 

The Vector Space Model (VSM) '4\ computes the cosine similarity of documents 
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based on the common terms present in the documents and their approximate im- 
portance. The importance of a term is gauged by its frequency of occurrence within 
a document and its sparse presence in the rest of the document collection. We use 
VSM as the primary mechanism for computing the similarity among the metadata. 
Salton describes the formula used as the best fully weighted system and recommends 
its use in the processing of text abstracts [I], which is similar to the text found in a 
metadata record. Our aggregator returns the identifier similarities within the context 
of the standard OAI-PMH "GetRecord" request. A service provider could use this 
data to further process its records (e.g. creating links to similar documents, deleting 
them, logging them for examination, etc.). 



2 Methodology 

For proof of concept purposes, NASA's Langley Technical Reports Server (LTRS) 
[2] served as the repository to generate our test collection of 3751 records. To keep 
interface complexities to a minimum, only a predetermined (ten) number of matches 
(top similar documents and their similarity score) are returned and appended to the 
GetRecord response. Currently this number is set by our aggregator, but could easily 
be passed by the client as well. The following URL could yield the result shown in 
Figure El 



http : //128 . 82 . 7 . 113 : 5180/perl/NASA_ltrs/?verb=GetRecord&metadataPref ix=oai_dc&identif i< 



Use of the VSM permits the makeup of the record collection to affect the im- 
portance of each term in a given document. Given the large number of OAI-PMH 
records that would ultimately be harvested and evaluated, it would be extraordinarily 
costly to compare the full text documents. Instead, the metadata records are used. 
Individual service providers could be relieved of the expensive burden of calculating 
record similarity and may instead query (and even re-harvest) this data computed by 
the similarity engine. 

Harvested records are cached in a hierarchical file structure. This structure is 
duplicated to store each file's term list /frequencies (in a tLmetadata directory) and 
again to store each file's term weights (in weights_metadata) after idf has been cal- 
culated. Since additional harvests will add additional records, this will require recal- 
culating collection idf weights and similarities. Note that idf values are not cached 
as these values are calculated at runtime. Once the tLmetadata directory is created, 
it could be used for subsequent collection calculations. This does not hold for the 
weight_metadata files, for their values are influenced by the runtime idf weight calcu- 
lations. 
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3 Results 



LTRS was harvested in the first week of April 2003, yielding 3751 records. The simi- 
larity engine calculated the similarities of these 3751 records in just over 7 hours. This 
is taking into account document term parsing, document term frequency calculation, 
idf, term weights for each document and the similarity calculations per document 
into the total runtime. Dividing this by the number of documents we can estimate 
the cost of a single document to document comparison at approximately .0036 seconds. 

Given this cost per similarity calculation, we can get an idea of how this would 
scale. To compare a document's similarity to all other documents requires an order 
0(n 2 ) operation, even if we are only calculating the upper triangular portion of a 
document to document matrix 0(((n 2 — n)/2)). Reducing this order of complexity 
is beyond the scope of this proof of concept, although we are currently investigating 
other techniques. The estimated similarity computation time for an increasing num- 
ber of records is shown in table ^ 
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Figure 3: Sample GetRecord response matching identifier. 



Number of Records 


Estimated Computation Time 


1UU 


1 1 seconds 


1 nnn 
1UUU 


30 minutes- 1 second 


rnnn 


12 hours-31 minutes-18 seconds 


10000 


2 days-2 hours-5 minutes-30 seconds 




io cidys-i iiuui-u iiiiiiuLeb-oi seconds 


50000 


52 days-4 hours-23 minutes-36 seconds 


100000 


208 days-17 hours-37 minutes-24 seconds 


1000000 


57 years-68 days- 14 hours-5 1 minutes-36 seconds 


6500000 


2416 years-71 days-3 hours-45 minutes-2 seconds 



Table 1: Estimated similarity computation time for an increasing number of records. 



0.612268646499951 


oai:ltii5.1anuiasa.gov:aiaa-9Q-3435 


0.437324585723922 


oai:ltrs.larc.iiasa.gov:aiaa-89-3525 


□.192 19B36B42B42S 


oai :ltre.laic nasi gov :NAS A-99-tm209 123 


□ .1917 1416B 117722 


oai:ltrs.larc.iiasa.gov:conf-53-iiiito-agaid 


181853507628537 


oai :lto.larc liasa .gov :aiaa-92-454 1 


□ .146374-107816182 


oai Itrs.larc.msa .gov :i"dp4440 .te x 


0. 139737036684787 


oai iltre.laic .msa .gov :aiaa-92-4 145 


□ .124248293995541 


oai Itislare .ltasa .gov :aiaa-39-33 12 


0.115284832380858 


oai Jtre.larc ,ibb gov :NAS A-200 1-taG 10644 


0.102123657461834 


oai :ltrs.larc .liasa .gov :conf- 10-da.ac 



Table 2: Top ten matches in LTRS collection to file oai:ltrs.larc.nasa.gov:9-dasc. 

The computed results were written to the file "similarities.txt". The top ten col- 
lection matches and their nature of similarity are represented in (Appendix 2). To 
further process the results, a script was written which read "similarities.txt" and cre- 
ated a directory of results. Each file in it was named after a document and represented 
the top 10 (user definable) closest matches found, sorted high to low (table EJ)- With 
this built, the results were ready to be exported. 

A fully functional OAI-PMH compliant aggregator was built that used the har- 
vested records as its repository. Added to the data provider was the ability for it 
to serve the similarity information along with the other metadata normally provided. 
This modification was done within the scope of the OAI-PMH, thus maintaining OAI- 
PMH compliance. The similarity data is housed within an <about> section, which is 
available as an optional part of a GetRecord response. An XML schema was written 
for this container, as per OAI-PMH specifications (Appendix 1). 

The average similarity calculation time between every two documents took ap- 
proximately 0.0036 seconds. The result of 7,033,125 such similarities was stored in 
a 678.1 MB file (98.7 MB if compressed). The disk space for the harvested LTRS 
collection, tLmetadata and weight_metadata were 15.2 MB, 15.2 MB and 17.3 MB 
respectively. 



6 



4 Conclusions 



We have demonstrated a proof of concept showing the use of information retrieval 
technologies with the OAI-PMH. This cross fertilization yields valuable results. At 
the data provider, similarity results may be used in detection of duplicates within the 
collection as well of the location of related documents (e.g. the Vol 1 / Vol 2 scenario). 
This information can be used for further grouping and association making. Through 
the interface demonstrated the user is provided with top ten (in our case) matches. 
A harvester could use this information to create links to these associated documents 
for the end user to peruse. Such links could direct users to alternative versions of 
documentation or to subsequent parts of the same report. 

This project is only the beginning of a greater investigation into the similarity 
evaluation of OAI-PMH metadata records. The algorithm used here is 0(n 2 ), which 
poses great scaling issues. While there are many optimizations which may still be 
made, including parallelizing the calculations through a distributed calculation mech- 
anism, in the end another algorithm must be employed. 
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Appendix 1. XML schema for similarity in <about> 
container 



<?xmf vers ion - h l„Q H encoding- "UTF-B" 7> 

<!'■ nr 5cy-Bis« j»neraLp;i by VKSPf, modified by T«rry Harrison — > 
- <x^:5chema target^amespaee^DiXTerT^XabDurSthema^V xrnln5:x5 = n http://www.w3.a!rg/2u0l/ 
XMLSthema ' xmlrs= J 'D;\Tenry\abDut5chemfl4V" clcmenCForrnDcrault= qualified' 
att'lbufceFormDefaLlt- unqualified 1 ' - 

- ■cxsiannoEarnn?* 

<x5:doeumentattDn>Thfji XML Schema can be used to validate the "about" section in an QAI- 
PHH document Terry L. Harrison . April 22, 2003-«/x5;dDc:umEntations- 
<yHs;annotation> 

- <xs:elEmEnt name- "*tmilar^> 

- <jtS:CCH"TiplexType> 

- <tts :sEquencG> 

<x5:element name- mastirDoc" -naxOrcurs^"!" type- "JCBistrlno' /> 

<K3telement ref- 'mawh" maxOe:uii^="unl»urKfed ,f /> 
</xs: sequencer 
< /xs : compl BxType:> 
■^/M5:elemGnt> 

- <K"5:ElRmpnr r.amn-"matGh V 

- <)(5:coniplaxTV!pe> 

- <Hs;£impJeContent> 

- ■ixs:eKtensi5 r i I'^ec -"Ks:string 

<xs:attnbute name-' , *IinHsrfty" type^'wsidouble" us.<?= required" f> 
</xs:sxren£>on> 
<^xs : 5lmoleContent> 
«:/)(s:complif»T¥f)i»> 
</xsielemepiC> 
</js: schema* 
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Appendix 2. Top 10 similar document pairs in 
LTRS collection 



Rank 



Titles of Similar Documents 



Similarity Relatedness 



Space Environmental Effects on Space- 
craft:LEO Materials Selection Guide 
(oai:ltrs.larc.nasa.gov:NASA-95-cr4661ptl) 
Space Environmental Effects on Space- 
craft:LEO Materials Selection Guide 
(oai:ltrs.larc.nasa.gov:NASA-95-cr4661pt2) 

Nondimcnsional parameters and equations for 
Buckling of Symmetrically Laminated Thin Elas- 
tic Shallow Shells (oai:ltrs.larc.nasa.gov:NASA- 

90- tml02716) 

Nondimcnsional parameters and equations for 
Buckling of Symmetrically Laminated Thin Elas- 
tic Shallow Shells (oai:ltrs.larc.nasa.gov:NASA- 

91- tml04060) 

Development of Pneumatic Channel Wing 
Powered-Lift Advanced Super-STOL Aircraft 
(oai:ltrs.larc.nasa.gov:NASA-aiaa-2002-2929) 
Pneumatic Channel Wing Powered- 
Lift Advanced Super-STOL Aircraft 
(oai:ltrs.larc.nasa.gov:NASA-aiaa-2002-3275) 

Compendium of NASA Data Base for the 
Global Troposphcric Experiment's Pacific Ex- 
ploratory Mission- Topics B(PEM- Tropics B) — 
Volumel:DC-8. (oai:ltrs.larc.nasa.gov:NASA- 
2000-tm210617voll) 

Compendium of NASA Data Base for the 
Global Troposphcric Experiment's Pacific Ex- 
ploratory Mission- Topics B(PEM- Tropics B) — 
Volumc2:P-38. (oai:ltrs.larc.nasa.gov:NASA- 
2000-tm210617vol2) 

Computational Methods for Frictional Contact 
With Applications to the Space Shuttle Or- 
bitcr Nose-Gear Tirc-Dcvclopmcnt of Frictional 
Contact Algorithm. (oai:ltrs.larc.nasa.gov:NASA- 
96-tp3573) 

Computational Methods for Frictional Contact 
With Applications to the Space Shuttle Or- 
biter Nose-Gear Tire-Development of Frictional 
Contact Algorithm. (oakltrs.larc. nasa.gov:NASA- 
96-tp3574) 



0.9822 



Part 1 and Part 2 of same re- 
port. 



0.980 



Suspected metadata error. 



0.9617 



Same paper (or close match) 
submitted to two conferences 
having different identifiers. 



0.958 



Volume One and Volume Two 
of same report. 



0.956 



Suspected metadata error. 



9 



6 


A Cryogenic Magnetostrictive Actuator Us- 
ing a Persistent High Temperature Super- 
conducting Magnet, Part 1: Concept and 
Design, (oairltrs. larc. nasa. gov:NASA-2000- 
tm209139) 

A Cryogenic Magnetostrictive Actuator Us- 
ing a Persistent High Temperature Super- 
conducting Magnet, Part 1: Concept and 
Design, (oaidtrs. larc. nasa. gov:NASA-99-6spie- 
gchl) 


0.950 


Technical Report version of a 
conference paper. 


7 


International Space Station Evolution Data 
Book- Volumel.(oai:ltrs. larc. nasa. gov:NASA- 
2000-sp6109vollrevl) 

International Space Station Evolution 
Data Book- Volumc2 Evolution Concepts- 
Revision A. (oairltrs. larc. nasa. gov:NASA-2000- 
sp6109vol2revl) 


0.937 


Volume One and Volume Two 
of same report. 


8 


A Model for Assessing the Liability of Seemingly 
Correct Software. (oai:ltrs. larc. nasa. gov:NASA- 
96-icai.jlr) 

A Model for Assessing the Liability of Seemingly 
Correct Software . (oai :ltrs . larc . nasa. gov:N AS A- 
96-tml 10247) 


0.935 


Technical Report version of a 
conference paper. 
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NASA's Atmospheric Effects of Aviation 
Project — Result of the August 1999 Aerosol 
Measurement Intercomparison Workshop, 
Laboratory Phase, (oahltrs. larc. nasa. gov:NASA- 
96-iastcd.jmv) 

NASA's Atmospheric Effects of Avia- 
tion Project — Result of the August 
1999 Aerosol Measurement Intercompar- 
ison Workshop, T-38 Aircraft Sampling 
Phase. (oai:ltrs. larc. nasa. gov:NASA-96-icrqcr- 
jmv) 


0.935 


Different phases of same 
project. 
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Integrated Orbit, Attitude and Structural Con- 
trol Systems Design for Space Solar Power 
Satellites. (oai:ltrs. larc. nasa. gov:JN AS A-2001- 
tm210829) 

Integrated Orbit, Attitude and Structural Con- 
trol Systems Design for Space Solar Power 
Satellites, (oaidtrs. larc. nasa. gov:NASA-2001- 
tmlll226) 


0.928 


Suspected metadata error. 
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